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Abstract 

The effect of errors in variables in quantization is investigated. We prove general exact 
and non-exact oracle inequalities with fast rates for an empirical minimization based on a 
noisy sample Zi = Xi + ei,i = 1, . . . , n, where Xi are i.i.d. with density / and are i.i.d. 
with density 77. These rates depend on the geometry of the density / and the asymptotic 
behaviour of the characteristic function of 77. 

This general study can be applied to the problem of fc-means clustering with noisy 
data. For this purpose, we introduce a deconvolution fc-means stochastic minimization 
which reaches fast rates of convergence under standard Pollard's regularity assumptions. 
Keywords: Quantization, Deconvolution, Fast rates. Margin assumption, fc-means clus- 
tering. 

1. Introduction 



The g oal of empirical vector quantization ( Graf and Luschgv ( 200d )) or clustering ( Hartigan 
( I975I )) is to replace data by an efficient and compact representation, which allows one to 
reconstruct the original observations with a certain accuracy. The problem was originated 
in signal processing and has many applications in cluster analysis or information theory. 
The statistical model could be described as follows. Given independent and identically 
distributed (i.i.d.) random variables Xi, . . . , Xn, with unknown law P with density / on M'^ 
with respect to the Lebesgue measure, we want to choose a quantizer (or classifier) g G Q, 
where Q is the set of all possible quantizers (or classifiers). The measure of the accuracy of 
g will be evaluate thanks to a distortion or risk given by, for some loss function I: 



R{g)=Epi{g,X) 



I 



Hg,x)f{x)dx. 



The most investigated example of such a framework is probably cluster analysis, where 
given some integer k >2, we want to build k clusters of the set of observations Xi, . . . , X^- 
In this framework, a classifier g £ G assigns cluster g(x) £ {!,..., fc} to an observation 
X G M'^. 

However, in many real-life situations, direct data Xi, . . . , X„ are not available and mea- 
surement errors occur. Then, we observe only a corrupted sample Zi = Xi + ei,i = 1, . . . n 
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with noisy distribution P, where ei, . . . , e„ are i.i.d. independent of Xi, . . . , X„ with density 
rj. The problem of noisy empirical vector quantization or noisy clustering is to represent 
compactly and efficiently the measure P when a contaminated empirical version Z i , . . . , Zn 



i s obs erved. This problem is a particular case of inverse statistical learning (see iLoustau 



and is known to be an inverse problem. To our best knowledge, it has not been yet 



considered in the literature. This paper tries to fill this gap by giving a theoretical study of 
this problem. The construction of an algorithm to deal with clustering from a noisy dataset 
will be the core of a future paper. 

A quiet natural habit in statistical learning is to endow clustering or empirical vector 
qua ntization into the general and extensively studied problem of empir ical risk minimization 
(see lVapnikl (|200d ^ iBartlett and MendelsonI (l2006l ) lKoltchinskiil toO^ )). This is exactly the 



guiding thread of this contribution. For this purpose, given a class of classifier or quantizer 
Q (possibly infinite-dimensional space), let us consider a loss function i : Q x where 
i{g, x) measures the loss of g at point x. In such a framework, given data Xi, . . . , X„, it is 
extremely standard to consider an empirical risk minimizer (ERM) defined as: 

1 " 

gn e arg min - V £(5, ) . (2) 
geg n ^ 

Since the pioneer's work of Vapnik, many authors have investigated the statistical perfor- 
mances of ([2]) in such a generality. We describe below two possible examples that fall into 
the specific problem of clustering or empirical quantization. 



Example 1 (The /c-means clustering problem) The finite dimensional clustering prob- 
lem deals with the construction of a vector c = (ci, . . . ,Ck) € to represent efficiently 
with k > 1 centers a set of observations Xi, . . . ,X„ G R'^. For this purpose, it is standard 
to consider the loss function 7 : x M.'^ defined as: 



7( 



mm 
j=i,...k 



In this case, the empirical risk minimiz er is giv e n by C n = arg inin T^^- j 
and is known as the popular k-means llPollard h98i ) \Pollard 1198^ )) 



1 mi%= 



IIX- 



Example 2 (Learning princip al curves) Another "pos sible example is to con sider quan- 
tizatio n with principal curves (see Biau and Fishei (20li) ). In the definition of Kegl et al. 
i '200d }. a principal curve can be defined as the minimizer of the least-square distortion: 



W{g) =Epinf 

over a collection of parameterized curves g : t {gi{t) , . . . , gd{t)) . Principal curves can be 
useful in a wide range of statistical learning or data rnininq problems, such as speech recogni- 
tion, social sciences or geology (see Biau and Fishe-\ (20 li ) and the references therein). As 
in we can minimize the empirical least-square distortion Wn{g), namely the distortion 
integrated with respect to the empirical measure. 
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In this paper, we propose to adopt a comparable strategy in the presence of noisy 
measurements. Since we observe a corrupted sample Zj = Xj+ej, i = 1, . . . ,n, the empirical 
risk minimization ([2]) is not available. However, we can introduce a deconvolution step in 
the estimation procedure by constructing a kernel deconvolution estimator of the density / 
of the form: 

1=1 ^ ' 

where /C.^ is a deconvolution kernel and A = (Ai, . . . , A^) G is a regularization parameter 
(see Section [2] for details). With a slight abuse of notations, we write in ([3]), for any 
X = (xi, . . .,Xd),Zi = {Zi^i, Zd,i) G M'': 



A'^n A Ai A, 

Given this estimator, we construct an empirical risk by plugging ([3]) into the true risk ([T]) 
to get a so-called dec o nvolu tion empirical risk minimization. The idea was originated in 



Loustau and Marteaul (l2012l ) for discriminant analysis. To fix some notations, in this paper. 



a solution of this stochastic minimization can be written: 



1 " 

G arg min {g) , where R^^ (5) = - V (5, ) • (4) 
96g n ^ 

1=1 



Section [2] is devoted to the detailled construction of the deconvolution empirical risk Rn(-), 
throught the loss i\{g, •)■ 

The purpose of this work is to study the statistical performances of 5^ in ([H in terms 
of oracle inequalities. On the one hand, we study the theoretical performances of g^ thanks 
to exact oracle inequalities. An exact oracle inequality states that with high probability: 

R{g^)<mfR{g)+rn,f,^{g), (5) 
gee 

where rnj,n{Q) — > as n — )• 00. The residual term Tnj^niG) is called the rate of conver- 
gence. It is a function of the complexity of G, the behaviour of the density /, and the density 
of the noise rj. In this paper, the behaviour of / depends on two different assumptions : a 
margin assumption and a regularity assumption. The margin assumption is related to the 
difficulty of the problem whereas the regularity assumption will be expressed in terms of 
anisotropic Holder spaces. 

On the other hand, we propose non-exact oracle inequalities, i.e. the existence of a constant 
e > 0, such that with high probability: 

R{gi:)<{l + e)mfR{g)+rlf^^ig). (6) 
The main difference betw een (O and dSl) resides i n the residuals which appears in the Right 



Hand Sides (RHS). As in lLeci^ and MendelsonI Jioil), one of the message of this paper is 



to highlight the presence of faster rates of convergence (i.e. f ^ = o{rnj^n) as n — )■ 00) for 
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non-exact oracle inequalities. The cor nerstone idea of these results resides in a bias - variance 



decomposition of the risk R{g^) as in iLoustau However, in comparison to iLoustaul 

;his work extend the previous results to unsupervised learning, non-exact oracle 



inequalities and to an anisotropic class of densities /. 

The paper is organized as follows. In Section [2l we present the method and the main 
assumptions on the density r] (noise assumption), the kernel in ([3]) and the density / (reg- 
ularity and margin assumptions). We state the main theoretical results in Section [3l which 
consists in exact and non-exact oracle inequalities with fast rates of convergence. It allows 
to recover recent results in the area of fast rates. These results are applied in Section |4] for 
the problem of finite dimensional clustering with fc-means. Section [5] concludes the paper 
with a discussion whereas Section [6117] give detailled proofs of the main results. 



2. Deconvolution ERM 

2.1 Construction of the estimator 



The d econvolution ERM introduced in t his paper is orig inally due to iLoustau and Marteau 
( 2OI2I ) in discriminant analysis (see also Loustau ( 2012 ) for such a generality in supervised 
classification). The main idea of the construction is to estimate the true risk ([T|) thanks to 
a deconvolution kernel as follows. 



Let us introduce K, = HiLi '■ 



a d-dimensional function defined as the product 



of d unidimensional function )Cj. Besides, /C (and also rj) belongs to L2{M. ) and admits a 
Fourier transform. Then, if we denote by A = (Ai, . . . , A^) a set of (positive) bandwidths 
and by J-'[-] the Fourier transform, we define ICj^ as: 



ICr. 



t ^ )Cr,{t) = T 



-1 



it). 



(7) 



.-^W(VA) 

Given this deconvolution kernel, we construct an empirical risk by plugging dS]) into the 
true risk R{g) to get a so-called deconvolution empirical risk given by: 



1 ^ f 1 

Rnia) = -y^J\i9,Zi) where £x{g,Zi) = / £{g,x)-ICr^ 

Jk X 



A 



dx. 



(8) 



Note that for technicalities, we restrict ourselves to a compact set K C M.'^ and study the 
risk minimization ([T|) only in K. Consequently, in this paper, we only provide a control of 
the true risk ([1]) restricted to K, namely the truncated risk: 



RK{g) 



i{g,x)f{x)dx. 



K 



This restriction has been con sidered in Mammen and Tsybakov (jl999l ) (or more recently in 



Loustau and Marteaul (120121 )). It is important to note that when / has compact support, 
we can see coarsely that Rxig) = Rig) for great enough K. In the sequel, for simplicity, 
we write R{-) for the restricted loss defined above. The choice of K is discussed in Section 
[3land depends on the context. 
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2.2 Assumptions 

For the sake of simplicity, we restrict ourselves to moderately or midly ill-posed inverse 
problem as follows. We introduce the following noise assumption (NA): 

(NA): There exist (/3i, . . . , Pd)' G such that: 

\J^[r]]{t)\ ^Uf^^\ti\-^\as \ti\ ^+00, Vie {1,..., 4. 
Moreover, we assume that J-'[r]]{t) / for alH = (ti, . . . , td) G M"^. 

Assumption (NA) deals with the asymptotic behaviour of the characteristic function 
of the noise distrib ution . The s e kind o f restr ic tions are stand ard in deconvolution prob- 
lems for d = 1 (seel^ (|l99ll ): iMeisted (|2009l ): iButuceal (|2007l )). In this contribution, we 



only deal with d-dimensional mildly ill-posed deconvolution problems, which corresponds 
to a polynomial decreasing of J-lrj] in each direction. For the sake of brevity, we do not 
consider severely ill-posed inverse problems (exponential decreasing)or possible intermedi- 
ates (e.g. a combi n atiori of polynomial and exponential decreasing functions). Recently, 
Comte and Lacour ( 20121 ) proposes such a study in the context of multivariate deconvolu- 



tion. In our framework, the rates in these cases could be obtained through the same steps. 
We also require the following assumptions on the kernel /C. 
(Kl) There exists S = {Si, . . . , Sd) G K^, Ki > such that kernel JC satisfies 
suppJ^m C [-S,S] and sup \J^[JC]{t)\ < Ki, 



where supp (7 = {x : g{x) ^ 0} and [—S, S] = ®'l=i[—Si, Si]. 

This assumption is trivially satisfied for different standard kernels, such as the sine ker- 
nel. This assumption arises for technicalities in the proofs and can be relaxed using a finer 
algebra. Moreover, in the sequel, we consider a kernel of order m, for a particular m G N*^. 

K(m) The kernel fC is of order m = (mi, . . . , m^) E N"', i.e. 

• j^d IC{x)dx = 1 

• f^d }C{x)xjdx = 0,yk < rrij, \/j G {1, . . . , d}. 
. \lC{x)\\xjr^dx < K2, Vj G {1, . . . ,4. 

The construction of kernels satisfying K(m) could be managed as in iTsvbakovl (|2004al ). 



This property is standard in nonparametric kernel estimation and allows to get satisfying 
approximations using the following assumption over the regularity of the density /. 

Definition 1 For some s = [si, . . . , Sd) G K^, L > 0, we say that f belongs to the 
anisotropic Holder space 7i{s, L) if the following holds: 
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the function f admits derivatives with respect to xj up to order [sj\ , where \_Sj\ 
denotes the largest integer less than sj. 

Vj = 1, . . . , d, Vx G M'^, \/Xj G M, the following Lipschitz condition holds: 



{dx. 



< L\x'j -Xj\'^-^'^K 



If a function / belongs to the anisotropic Holder space Ti^s, L), f has an Holder regularity 
Sj in each direction j = 1, . . . ,d. As a result, it can be well-approximated pointwise using 
a d-dimensional Taylor formula. 



3. Main results 

It is well-known that the behaviour of the rates of convergence rnj,n{G) in ([5]) or r* j^(^) 
in ([6j) is governed by the size of Q. In this paper, the size of the hypothesis space will be 
quantified in terms of e-entropy with bracketing of the metric space {{£{g),g G Q},L2) as 
follows. 

Definition 2 Given a metric space {J-, d) and a real number e > 0, the e-entropy with 
bracketing of {J-^d) is the quantity T-Lb{^ ^^^d) defined as the logarithm of the minimal 
integer Nb{() such that there exist pairs {fj,gj) £ T x J^, j = 1, . . . , NB^e) such that 
fj < Oj, d{fj,gj) < e, and such that for any f £ F, there exists a pair {fj,gj) such that 
fj < f <9j- 

This notion of corn plex ity allows to obtain local uniforrn concentration inequalities (see 



der Vaart and Weelner 1199^)). Indeed, to reach fast rates 



Van De Geerl (l200d ') or ivan 
of convergence (i.e. faster than what really matters is not the total size of the 

hypothesis space but rather the size of a subclass of Q, made of functions with small 
errors. In this paper, we use an iterative localization principle originally introduced in 



Koltchinskii and Panchenko ( 200d ) (see also Koltchinskii ( 2006 ) for such a generality). More 
precisely, to state exact oracle inequalities, we consider functions in Q with small excess risk 
as follows: 



g{S) = {geg: R{g) 



inf R{g) < 5}, 



whereas to get non-exact oracle inequalities, we consider the following set: 

Q'{5) = {ff G g : R{g) < 5}. 



Originally, Mammen and Tsvbakovl ( 19991 ) (see also Tsybakov (2004b)) formulated an 
usefull condition to get fast rates of convergence in classification in the exact case. This as- 



sumpt ion is known as the margin assumption and has been generalized by lBartlett and Mendelson 



(|200fil V coarsely speaking, a margin assumption guarantees a nice relationship between the 
variance and the expectation of any function of the excess loss class. In this contribution, 
it appears as follows: 
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Margin Assumption MA(k) There exists some k > 1 such that: 



V5ea,||%,-)-^(/(5),-)llL <'^o 



R{g)-mfR{g) 



i/ft 



for some kq > and where g*{g) G argmin/^gg R{h) can depend on g when |^/(0)| > 2. 



Gathering with a local concentration inequality (see Theorem [T7] in Section [6]) applied 
to the class Q{S), this margin assumption is used in the exact-case to get fast rates. Note 
that provided that l{g, •) is bounded, MA(k;) implies MA(k') for any k' > k. Int erestingly, 
in the framework of finite dimensional clustering with A;- means, iLevrardI (|2012l ) proposes 
to give a sufficient condition to have MA(k) with k = 1. This condition is related with 
the geometry of / with respect to the optimal clusters and gives well-separated classes. It 
allows to interpret MA(fi;) exactly as a margin assumption in clustering (see Section [5]). In 
the sequel, w e call the parameter k in MA f/t) the margin parameter. 

Recently, iLecue and MendelsonI (j2012l ) points out that one could wish non-exact oracle 
inequalities with fast rates under a weaker assumption. The idea is to relax significantly 
the margin assumption and use the loss class {i{g),g € G} in MA(k;) instead of the excess 
loss class {i{g) — ^{g*),g & G}- This framework will be considered at the end of this section 
for completeness. It leads to non-exact oracle inequalities in the noisy case. 



3.1 Exact Oracle inequalities 

We are now on time to state the main exact oracle inequality. 

Theorem 3 (Exact Oracle Inequality) Suppose (NA), (Kl), and MA(k) holds for 
some margin parameter k > I. Suppose f G T-L{s^L) and K(m) holds with m = [sj . 
Suppose there exists 0</9<l, c> O such that for every e > O.' 



^b(W5),5 Ga},e,L2) <ce- 



^2p 



(9) 



Then, for any t > 0, there exists some nQ{t) E N* such that for any n > no{t), with 
probability greater than 1 — e~*, the deconvolution ERM g^ is such that: 

R{g^) < inf R{g) + Cn-^''^'^'f'''^^'\ 
where C > is independent of n and t^^k, p, P, s) is given by: 



rd{K,p,(3,s) 



and A = (Ai, . . . , Xd) is chosen as: 



Ai n 



2k + P-1 + {2k-1)Y^ f3j /sj 



--rd(K,P,/3,s) 



Vj = l,...d. 



The proof of this result is postponed to Section [6j We list some remarks below. 
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Remark 4 (Comparison with Koltchinskii ( 20061 ) or Mammen and Tsvbakov ( 19991 )) 

This result gives the order of the residual term in the exact oracle inequalities. The risk of 
the estimator g'^ mimics the risk of the oracle, up to a residual term detailled in Theorem 
O The price to pay for the error-in-variables model depends on the asymptotic behaviour of 
the characteristic function of the noise distribution. // /3 = G in the noise assumption 
(NA), the residual term in Theorem\^ satisfies: 



rn{Q) = 0{n 



2K + P-1 



It corresponds to the standard fast rates stated i mKoltchinski'i or lMammen and Tsvbakon 

for the direct case. 



Remark 5 (Comparison with Loustau ( 20121 )) In comparison with Loustau 

these rates deal with an anisotropic b ehaviour of the density f . If Sj = s for any direction, 

we obtain the same asymptotics as in Loustau I201A) for supervised classification, namely: 



rn{Q) = O n 



s(2k+p-1) + (2«;-1) 



The result of Theorem gives a generalization of Lousta/H i 201i ) to the anisotropic case, 
in an unsupervised framework. It gives some intuition with respect to the optimality of this 
result. 

Remark 6 (The anisotropic case is of practical interest) The result of Theorem 
gives some insights into the noisy quantization problem with an anisotropic density f . In 
this problem, due to the anisotropic behaviour of the density, the choice of the regularization 
parameters Xj, j = 1, . . . ,d depends on j. This result is of practical interest since it allows to 
consider different bandwidth coordinates for the deconvolution ERM. In finite dimensional 
noisy clustering with k > 2, this configuration arises when the optimal centers are not 
uniformly distributed over the support of the density. This case could not b e treated at leas t 
fr om theoretical point of v iew us ing the previous isotropic approach stated in \LoustaA \2nii ) 



or 



Loustau and Marteav 



Remark 7 (Fast rates) The most favorable cases arise when p — t- and (3 is small, 
whereas at the same time density f has sufficiently high Holder exponents Sj. Indeed, fast 
rates occur when Tci{k, p, P, s) > 1/2, or equivalently, {2k — l)'^/3j/sj < 1 — p. If p = 
and K = 1 (see the particular case of Section\^, we have the following condition to get fast 
rates: 



^ < 1. 



Si- 



Remark 8 (Choic e of A) The op timal choice of X in Theorem\^ optimizes a bias variance 
decomposition as in Loustail (201i). This choice depends on unknown parameters such as 
the margin parameter k, the Holder exponents (si, . . . , Sd) of the density f and the degree 
of illposedness /3. A challenging open problem is to derive adaptive choice of A to lead to 
the same fast rates of convergence. This could be the purpose of future works. 
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Remark 9 (Comparison with IComte and Lacourl (j2012l )) It is also important to note 



that the optimal choice of the multivariate bandwidth A does not coincide with the opti- 
mal choice of the band width in standard nonpara metric anisotropic density deconvolution. 



Indeed, it is stated in \ Comte and Lacown 1(201^ ) that under the same regularity and ill- 



posedness assumptions, the optimal choice of the bandwidth A = (Ai, . . . , A^) has the follow- 
ing asymptotics: 



\u ^ n V =j J . 

The proposed asymptotic optimal calibration of Theorem is rather different. It depends 
explicitely on parameter p, which measures the complexity of the decision set Q, and the 
margin parameter k > 1. It shows rather well that our bandwidth selection problem is not 
equivalent to standard nonparametric estimation problems. It illustrates one more time that 
our procedure is not a plug-in procedure. 

3.2 Non-exact oracle inequalities 

In this section, we also suggest a non-exact version of Theorem [3] without the margin as- 
sumption MA(k). However, to get this result, we need an additional assumption about the 
compact K appearing in the empirical risk ([8|). The assumption has the following form: 

Density assumption DA(co) There exists a constant cq > such that the compact 
set in ([8]) satisfies: 

Kc{x: fix) > Co}. 

This assumption is trivially satisfied if / > in M"^ with a constant cq depending on the 
size of K. Assumption DA(co) is necessary to get fast rates in the context of non-exact 
oracle inequalities without the margin assumption MA(k). We are now on time to state 
the following result. 

Theorem 10 (Non-Exact Oracle Inequality) Suppose (NA), DA(cq) and (Kl) holds 
for some constant cq > 0. Suppose f G T-L{s,L) and K(m) holds with m = [sj . Suppose 
there exists 0<p<l, c>0 such that for every e > 0.' 

^B(W5),5eg},e,L2)<ce-2''. 

Then, for any t > 0, there exists some no{t) G N* such that for any e > 0, for any n > no{t), 
with probabilty higher than 1 — e~*, satisfies: 

Rig) < (1 + e) inf Rig) + Cn-^*^P^^''\ 

where C > is a constant which depends on e, /3, s, p, cq and 

1 



T*ip,P,s) 



whereas A = (Ai, . . . , A^) is chosen as: 



l + p + ^f3j/s, 



T*(p,/3,a) 

Aj~n ,Vj = l,...d. 
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Remark 11 (Same phenomenon as in Lecue and Mendelson ( 20121 )) The quantity 
T*{p,f3,s) describes the order of the residual term in Theorem [TOl We can see coarsely that 
T*{p,l3,s) = r(l,/9, /3, s) where t{1, p, 13, s) appears in Theorem\M As a result, this oracle 
inequality gives the same asymptotic as the previous result under MA(k) with k, = \, which 
corresponds to the strong margin assumption. Here, it holds without any margin assump- 
tion. The prize to pay is the constant in fron t of th e infimum. This phenomenom has been 
already pointed out in Lecue and Mendelson (20 1^ ) in a supervised framework and in the 
direct case. Of course, constant C > in front of the rate depends on e > and exploses 
when € tends to (see condition l^2S\) in the proof). 



Remark 12 (The density assumption) Unfortunately, there is an additional assump- 
tion to get Theorem[Wiin comparison to Theorem\^ namely the assumption DA(cq). This 
assumption is specific to the indirect framework where we need to control the variance of 
the convoluted loss i\{g,Z) with respect to the variance of i(g,X). More precisely, we need 
the following inequality (in dimension d = 1 for simplicity): 

Ep£x{g,zf < A-2%p%,x)2, yg G g. 

This can be done only if we restrict £\{-) to a region where / > 0. Otherwise, th ere is no 
reason to obtain such a control (see Lemma [23[ and also the related discussion in iLoustau 
(2qR)). 



4. Application to finite dimensional noisy clustering 

The aim of this section is to use the general upper bound of Theorem [3] in the framework 
of noisy finite dimensional clustering. To frame the problem of finite dimensional clustering 
into the general study of this paper, we first introduce the following notation. Given some 
known integer k > 2, let us consider c = (ci, . . . , c^) G C the set of possible centers, where 
C C is compact. The loss function 7 : M°"^ x R"' is defined as: 



7(c,x) 



mm \\x 
j=i,...k" 



where || • || stands for the standard euclidean norm on M^. The corresponding true risk or 
clustering risk is given by R{c) = Ep7(c, X). In the sequel, we introduce a constant M > 
such that ll-'^lloo < M. This boundedness assumption ensures 7(0, X) to be bounded. 
The performances of the empirical minimizer c„ = argmin^ P„7(c) (also called /c-means 
clu stering a l gorith m) have been widely studied in the literature. C ons istency was shown 



by IPollardI (jl98lh when E||Xf < 00 whereas iLinder et all (|l994l ) or iBiau et al.l toO^ ) 



gives rates of convergence of the form 0{l/^/n) for the excess clustering risk d efined as 
R(cn) — R(c*), where c* G the set of all possible optimal clusters. More recently, LevrardI 
([2012) proposes fast rates of the form 0(1 /n) u nder Pollard's regularity assumptions. It 



improves a previous result of lAntos et al.l (120051). T h e ma. in ingredient of the proof is a 



localization argument in the spirit of Blanchard et al. ( 20081 ) . 

In this section, we study the problem of clustering where we have at our disposal a 
corrupted sample Zi = Xi + ei, i = 1, . . . ,n where the e^'s are i.i.d. with density rj satisfying 
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(NA) of Section [2j For this purpose, we introduce the following deconvolution empirical 
risk minimization: 

1 " 

argmin- V7A(c,Zi), (10) 
ceC n ^-^ 

i=l 

where 7a(c,z) is a deconvolution /c-means loss defined as: 

7a(c,2;)= / \lC.^ ( ^ X ^ ) mill H^; - Cj|pdx. 
Jk \ A y j=i,...k 

The kernel /C,; is the deconvolution kernel introduced in Section[2]with A = (Ai, . . . , A^) E 
a set of positive bandwidths chosen later on. We investigate the generalization ability of 
the solution of (jlOp in the context of Pollard's regularity assumptions. For this purpose, we 
will use the following regularity assumptions on the source distribution P. 

Pollard's Regularity Condition (PRC): The distribution P satisfies the following two 
conditions: 

1. P has a continuous density / with respect to Lebesgue measure on M*^, 

2. The Hessian matrix of c i — > Pj{c, .) is positive definite for all optimal vector of 
clusters c*. 

It is easy to see that using the compactness of B{0,M), ||X||oo < M and (PRC) ensures 
that there exists only a finite number of optimal clusters c* G Ai. This number is denoted 
as \A4\ in the rest of this section. Moreover, Pollard's conditions can be re lated to the 



margi n assumption MA(k) of Section [3] thanks to the following lemma due to lAntos et al 



Lemma 13 (jAntos et al.l (|2005h ) Suppose \\X\\oo < M and (PRC) holds. Then, for 



any c G B{0,M): 

||7(c, •) - 7(c*(c), .)U, < Ci\\c - c*{c)f < CiC2 {R{c) - R{c*{c))) , 
where c*(c) E argmiuc* ||c — c*||. 

Lemma [T3] ensures a margin assumption MA( k) with k = 1 (see Section [3|). It is useful 



to derive fast rates of convergence. Recently, iLevrardI (l2012l ) has pointed out sufficient 
conditions to have (PRC) as follows. Denote dVi the boundary of the Voronoi cell Vi 
associated with Cj, for i = l,...,/c. Then, a sufficient condition to have (PRC) is to 
control the sup-norm of / on the union of all possible \Ai\ boundaries dV*'"^ = U^^^dV*'"^ , 
associated with G as follows: 

ll/|u-M ^ov*-lloo < c{d)M''+^ inf P{V*n, 

' "1=1 m=l,...,\M\,i=l,...k 

where c{d) is a constant depending on the dimension d. As a result, the margin assumption is 
guaranteed when the source distribution P is well concentrated around its optimal clusters, 
which is related to well-separated classes. From this point of view, the margin assumption 
MA(k) can be related to the margin assumption in binary classification. 
We are now ready to state the main result of this section. 
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Theorem 14 Assume (NA) holds, P satisfies (PRC) with density f G T-L[s.,L) and 
E||e|p < oo. Then, for any i > 0, for any n > no(t), denoting by a solution of ilO\) . we 
have with probability higher than 1 — e~*; 

1 

R{c^) < inf R{c) + CVloglog(n)n '+^U^i^'j , 
where C > is independent of n and A = (Ai, . . . , A^) is chosen as: 

A,- p. n~^^^<^+^^+fcv^, Vj = 1, . . . d. 



The proof is postponed to Section [HI Here foUows some remarks. 



Remark 15 (Fast rates of convergence) Theorem I4 is a direct application of Theorem 
in Section [21 The order of the residual term in Theorem [I^ is comparable to Theorem 
[3 Due to the finite dimensional hypothesis space C C M'^'^, we apply the previous study 

to the case p = 0. R leads to the fast rates O ( n ^+^1=11^]^"] \ up to an extra -y/log log n 



term. This term is due to the localization principle of the proof, which consists in applying 
iteratively the concentration inequality of Theorem \ 1 7\ In the finite dimensional case, when 
p = 0, we pay an extra Vloglogn term in the rate by solving the fixed point equation. Note 



og 

that using for instance iLevrard ^201^ ). this term can be avoid. R is out of the scope of the 
present paper. 

Remark 16 (Optimality) Lower bounds of the form 0{l/y/n) have been stated in the 
direct case bv lBartlett et al. for general distribution. An open problem is to derive 



lower bounds in the context of Theorem\14\ For this purpose, we need to construct config- 
urations where both Pollard's regularit y assumption and nois e assu mption (NA) could be 
used in a careful way. In this direction, Loustau and Marteail 1(201^) suggests lower bounds 



in a supervised framework under both margin assumption and (NA). 
5. Conclusion 

This paper can be seen as a first attempt into the study of quantization with errors-in- 
variables. Many problems could be considered in future works, from theoretical or practical 
point of view. 

In the problem of risk minimization with noisy data, we provide oracle inequalities for an 
empirical risk minimization based on a deconvolution kernel. The risk of the deconvolution 
ERM mimics the risk of the oracle, up to some residual term, called the rate of convergence. 
The order of these rates depends on the complexity of the hypothesis space in terms of 
entropy, the behaviour of the density / and the degree o f ill-posedness. From the theoretical 



point of view, these results extend the previous study of lLoustaul (120121 ) to the unsupervised 
framework, the non-exact case and to an anisotropic behaviour of the density /. These 
significant extensions could be the core of many applications in unsupervised learning. 

As an example, we turn into the problem of clustering with fc-means. We consider the 
general approach and introduce a deconvolution kernel estimator of the density / in the 
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distortion. It gives rise to a new stochastic minimization called deconvolution /c-means. The 
method gives fast rates of convergence. 

Another possible direct application of the result of this paper is to learn principal curves 
in the presence of noisy observations. In such a problem, the aim is to design a principal 
curve for an unknown distribution P when we have at our disposal a noisy dataset Zi = 



Xi + ei 



1, . . . , n. To the best of our knowledge, this problem has not been considered 



in the literature. Following the ERM approach of this paper, it is possible to design a new 
procedure to state rates of convergence in the presence of noisy observations. 

The general deconvolution ERM principle introduced in this paper can be used to design 
new algorithms to deal with unsupervised statistical learning with noisy observations. As 
a first step, the construction of a noisy version of the well-known fc-means is a core of a 
future work. The con s tructi on of a noisy version of the Polygonal Line Algorithm (see 
Sandilya and Kulkarnil (j2002l )) could also be investigated, to deal with learning principal 
curves from indirect observations. 



6. Proofs 

The main probabilistic tool for our needs is the following concentration inequality due to 
Bousquet. 



Theorem 17 (iBousquetl (|2002| )) Let Q a countable class of real-valued measurable func- 
tions defined on a measurable space X. Let be n i.i.d. random variables with 
values in X . Let us consider the random variable: 



Zn{Q) = sup 

geg 



Then, for every t > 0; 



{Zn{Q) > Un{g,t))<e-\ 



where: 



and 



I2t f 

Un{Q,t) = ^ZniG) + [ctHG] + (1 + 6(g))EZ„(g)] + — , 



<^'^{Q) = supE5r(Xi)^ and b{G) = sup \\g\ 
gee geg 



The proof of this r esult uses the so -ca lled entropy method introduced by Ledoux ( 19961 ). and 
further refined by iMassartI (I2OO0I ) or [13 (l200d'). The use of a -version (see for instance 
AdamczakI ( 20081 )) has been considered in iLecue and MendelsonI ( 20121 ). to alleviate the 
boundedness assumption. 

This concentra tion inequality is at the core of the localization principle presented in 
Koltchinskij (jioO^), which consists in using Theorem 1 171 to functions in Q with small error. 
In the following, we extend this localization approach to: 



• the noisy set-up, 
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• the non-exact case. 



For this purpose, we apply Theorem [T7] to particular classes Q, namely excess loss classes for 
the exact case and loss classes for the non-exact case. These two extensions are proposed in 
Lemma [18] and [19] below. These results are at the core of the general exact and non-exact 
oracle inequalities of Theorem [3] and Theorem [10] in Section [3] 



6.1 Intermediate lemmas 
6.1.1 Notations 

Let us first introduce the following notations. For any fixed g £ G, we write: 

R\9) = JJig,x)EpjlC iy-^j dx and Ri^ig) = -J^txig, Z,). 

As a result, for any fixed g G G, we have the following equality: 

1 " 

Rilig) - R\g) = - J] hig, Z,) - Eph{g, Z). 

i=l 

With a slight abuse of notations, we also denote: 

(i?^ - R^){g - g') = R^lig) - R\g) - R^^{g') + R\g'). 

The same notation is used for R^{ ) and R{-) with the quantity {R — R'^){g — g'). 
For a function : M+ — )• M+, the following transformations will be considered: 

i^iS) = sup and V't(e) = inf{5 > : i){5) < e}. 

o-><5 O' 

Moreover, we need the following property (see Koltchinskii ( 20061 )): 

V5' < 5, i!{5) < 6i^{5'). (11) 
We are also interested in the following discretization version of these transformations: 

^pq{S) = sup and 4(e) = mf{6 > : i)g{6) < e}, 

5j>5 Oj 



where for some q > 1, 6j = for j G N* . 

Finally, in the sequel, constants K,C > denote generic constants that may vary from 
line to line. 

6.1.2 Exact case 

The proof of Theorem [3] uses the following intermediate lemma. 
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Lemma 18 (Exact case) Suppose there exists some function a : A i— a(A) and a constant 
< r < 1 such that: 



(R - R^)ig - g*{g)) < a(A) + riR{g) - Rig*{g))), 



where g*{g) E arg min/i can depend on g. 
Then, for any q > 1, \/6 > 6x{t), we have: 



P(i?(5^)> infii(5) + <5)<log„(9)e 



where: 



5x{t) = max 5x{t), a(A) 

1 1 — r 



for 5\{t) = {U\{-,t))'^ ((1 — r)/4(7) and where we define, for some constant K > 0: 



Ux{d,t) ■.= K 



where 



EZa(5) + \j -axis) + a/ - (1 + 2bxiS)) EZa(5) + ^ 
n \ n 6n 



Zx{6):= sup {R^-R^){g-g') 



(12) 



axi5):= sup JEp{ix{9, Z) - Ixig' , Z))^ , 



bxiS) := sup \\£x{g,-) 



Proof The proof follows iKoltchinski] (jiooi) extended to the noisy set-up. 
Given g > 1, we introduce a sequence of positive numbers: 

6,=q-^,yj>i. 

Given n, j > 1, t > and A G R^j., consider the event: 

Ex,j{t) = {ZxiSj) < Ux{S,,t)}. 

Then, we have, using Theorem 1171 some K > 0, ¥{Exj{t)^) < e~*, \/t > 0. 
We restrict ourselves to the event Exj{t). 

Let e < C(5j+i where c > is chosen later on. Then, consider some g G G{e), where: 



gie) = {geg ■.R{g)-mf R{g)<e}. 
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Using assumption (|12p and the definition of g := g^, one has: 
R{g)-mi R{g) < R{g) - R{g) + e 

< (R-R^){g-g) + {R^-R^J{g-g)+e 

< {R^ - R^){g -g) + 2a(A) + r{R{g) - inf R{g)) + r{R{g) - inf R{g)) + e 
Hence, we have the following assertion: 

^i+i < R{g) - inf R{g) < 6j 6,+i < ((i?^ - R^){g - g) + 2a(A) + (1 + r)e) . 

On the event Exj{t), it follows that \/5 < 6j: 

Sj+i < Rig) - inf R{g) < 6j ^ 6,+i < iUxidj,t) + 2a(A) + (1 + r)e) 

geg i — r 

where Vx{6,t) = tJx{6,t) satisfies property pT]) . We obtain, for any 6 < 6j: 



1 — r q 1 — r 

" ~ 4{l+r) 



The assumption a(A) < (1 — r)d/8q and the choice of c = . A in the beginning of the 



proof gives the following lower bound: 

1 — r 



Vx{S,t) > 



2q 

It follows from the definition of the f-transform that: 

Hence, we have on the event Exj{t), for any 6 < 6j: 

Sj+i < R{g) - inf R{g) < 6, ^ 6 < 6^{t), 
g&Q 

or equivalently, 

h{t)<5<5j^giG{5j+i,5j), 
where ^(c, C) = {g ^ Q : c < R{g) — inf^gg Rig) < C}. We eventually obtain: 

fl Ex At) and 6 > 6x{t) R{g) - inf R{g) < 6. 

5,>5 

This formulation allows us to write by union's bound: 

F{R{g) > ini R{g) + 5) < V nEx^tf) < log, (\) 



since {j : 6j > 5} = {j : j < -|^}. 



s,>5 

loggJ 
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6.1.3 The non-exact case 

The proof of Theorem [10] uses the following version of Lemma [THl 

Lemma 19 (Non-exact case) Suppose there exists a* {■, ■) : (r, A) G (0,1) xR+ i— )■ a*(r, A) 
such that for any (r, A) G (0, 1) x 



yg E g, R{g) - R\g) < a*{r, A) + rR{g). 
Then, for any g > 1, a G (0, 1), u £ (0, 1/q), 6 > S'^{t): 



m{9n)>^)<^og-e-\ 



where: 



for 



6'x{t) =max((^^(t), 



(1 — r)au 



■a*{r,X), 



1 + r 



(1 — r)(l — a)u g&g 



iuiR{g) 



^l(t) = (f/;(-,t))t^(^-^)(l-^-) 



2q 



and where we define, for some constant K > 
U[i6,t) ■.= K 



n \ n 

where here, we write for g'{5) = {g £ Q '■ R{g) < 5}'- 

Z',{5):= sup iR^:-R^){g) 



(13) 



a'^{6):= sup JEpit^ig, Z))^ , 



b'^{6):= sup \Mg,-)\\oo. 

geg'is) 

Proof The proof follows the proof of Lemma [18] applied to the non-exact case. Given 
g > 1, we introduce a sequence of positive numbers: 

6,=q'^,yj>l. 

Given n,j>l,t>0 and A G M'|_, consider the event: 

E',^^{t) = {Z',{6,)<U',{6„t)}. 

Then, we have that, using Theorem [13 F{E'^^.{t)'^) < e'K 
We restrict ourselves to the event E'^-{t). 

Using assumption ([13]), we have, for any g £ G and any r G (0, 1): 



Ri9)< 



1-r 



iR^-R^,)ig) + a*ir,X)+R^lig) 
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where we use the definition of g = g^. Moreover, note that, using again assumption (jl3p : 

Rni9) = iRi:-R^){g) + iR^-R){g) + Rig) 
< {R^-R>^){g)+a*{r,\) + il + r)R{g) 

Then, we have, for g = g* & argming R{g): 

R{g) < J—({R^-R>^)(g*-g) + 2a*{r,X) + {l+r)mfR{g) 
l-r \ g&g 

We hence have on the event E'^-{t): 

Sj+i < Rig) < dj 6j+i < (2U[{6j,t) + 2a*{r, A) + (1 + r) inf R{g)^ , 

since in this case R{g*) < Sj. On the event it follows that \/6 < 6j: 

Sj+i < R{g) < Sj 6,+i < (26jV{{6, t) + 2a* (r. A) + (1 + r) inf R{g) 

I — r \ geg 

where V^{6,t) = U'^{-,t) is defined as above. We obtain, for any u G (0, l/q): 

^ Vi{S,t)>---^{2a{\) + {l + r)mlR{g))>--u, (14) 



1 — r q 1 — r geS q 

provided that for any a £ (0, 1), since 6 < Sj: 

a (r, A) < a— — and mf R(g) < (1 - a)— -6. 

2 geg 1 + r 

From ([HD, on the event Ex,j{t), for any inf^eg R{g) V (i_^)^„ a*(r-, X) < 5 < 6j: 

or equivalently, by definition of S'^{t): 

S'x{t)<S<6,^g^g'{6j+uSj), 
where here Q'{c, C) = {g £ Q : c < R{g) < C}. We eventually obtain: 

f] Exj{t) and 6 > R{g) < 6. 

5,>5 

This formulation allows us to write by union's bound, exactly as in the proof of Lemma [TSl 

mig) >s)<Yl nExAtf) < log, (]) e-*, (15) 

where 5 > 6'x{t). 
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6.2 Proof of Theorem [3] and \W\ 
6.2.1 Proof of Theorem [3] 

The proof of TheoremOis divided into two steps. Using Lemma[T21 we obtain an exact oracle 
inequahty when |^(0)| = 1. For the gener al case, we will introduce a more sophisticated 
localization explain in ( Koltchinskii . 20061 . Section 4). Moreover, we begin the proof in 
dimension d = 1 for simplicity. A slightly different algebra is precised at the end of the 
proof to lead to the general case. 
Case 1: \g{0)\ = 1. 

When 1^(0)1 = 1, it is important to note that MA(«;) holds with a minimizer g* £ G which 
does not depend on g. Then, we can write, for any g,g' G G{S)- 

- l{g')\\L, < im - l{g*)\\L, + 11%') - /(<7*)IIl. < 2^^l/2^ 



Gathering with the entropy condition ([9]), we obtain: 



E sup 

9,9'ee(<5) 



< E sup 

IK(9)-%')IU2<2v^5'/'" 



where we use in last line Lemma 1 in iLoustaul (120121 ). Then, using the notations of Lemma 

m 



K 



< K 



EZa(5) + \l -ax{5) + a/- (1 + 26a(<5)) KZx{6) + ^ 



i-p 



n 



i-p 



S— + J -a^^§) + W - (1 + 2bx{6)) -^6— + — 



n 



n 



3n 



It remains to control the L'^(P)-diameter crx{6) and the term bx{6) thanks to Lemma [20l 
Using again assumption MA(k), and the unicity of the minimizer g* , gathering with the 
first assertion of Lemma [2U1 we can write: 



ax{d)= sup JEpilxig,Z)-lxig',Z))^ <CX-^^S-^. 

Now, by the second assertion of Lemma [20l 

bx{d)= sup ||/a(5,-)||oo < CA-'3'i/2_ 



It follows that: 

Ux{5,t)<K 
We hence have the following assertion: 

t < aVEx~^ 



X~P 



. i-p 

I 2k 



\ n 



n \ n W n Jn 



3n 



(16) 
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From an easy calculation, we hence get in this case: 



^-^\ 2k+p- 



6x{t) < K 



where > is a generic constant. We are now on time to apply Lemma [TH] with: 

2k. 

5 = K \ — — I and t' = t + log log^ n. 



In this case, note that for any i > independent on n, the choice of A in Theorem [3] warrants 
that, for any n > no(t): 

t + log logg n < 5" K A ^/n\~^5^ . 
Moreover, using Lemma [211 we have in dimension d= 1: 



[R - R^){g - g*) < CX'^ + ;-{R{g) - R{g*)). 



As a result condition (|12p of Lemma[T8]is satisfied with r = 1/2 and a(A) = A^'^. We can 
also check that for n great enough, the choice of A in Theorem [3] guarantees: 



A^' < K 



X-l^\ 2K+P-1 



Finally, we get the result since: 



log, ' < 



log 



J logg(n] 



< e" 



For the d-dimensional case, we have the same algebra by replacing A ^ by Yl'^^iX, in 



the previous calculus and A^'^ by y]°'_i A^^'' thanks to Lemma [2T1 The choice of A,-, for 
j = 1, . . . , d in Theorem [3] allows to conclude. 
Case 2: \g{0)\ > 2. 

When the infimum is not unique, the diameter cr^{S) does not necessary tend to zero when 
(5 — >■ 0. We hence introduce the more sophisticated geometric parameter: 



'J 



r(cT, 5) = sup inf jEp{ix{g, Z) - £x{g', Z))^, for < a < 5. 
geg(<5)9'eGH V 



It is clear that r{a,6) < y (^x(^^ ^"-"^ — ^ 0, we have r{a,S) — )■ 0. The idea of the 
proof is to use a modified version of Lemma [T8] following ( Koltchinskii . 20061 . Theorem 4). 
More precisely, we have to apply the concentration inequality of Theorem 1 171 to the random 
variable: 



Wx{6) = sup sup 



(R^ - R^)ig - g'] 
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This localization guarantees the upper bounds of Theorem [3] when |^(0)| > 2. However, to 
this end, we have to check (for d = 1 for simplicity): 



lim E sup 



e-S>0 



sup 



{R^-R^){g-g' 



) <c^6'/^^, (1 



96g(a) g'eg{5y.y/¥.piex{g,Z)-£^{g',Z))^<r{a,S)+e 

and for < (J < 5: 

r{a,5) < CA-^(5^/2k^ 
Using MA(k) and Lemma 1 in Loustau ( 20121 ). it is clear that ()17p holds since: 



7) 



(18) 



E sup sup 

g^Gi^) g'&g{5):^Ep{ex{g,Z)-e^{g',Z)y-'<r{a,S)+ 



{R'^-R^){g-g'] 



< E sup 

(;eg(f7),g*eg(0) 



< 2E sup 

(9,5*)ee(5)xg(o) 



+ E sup 



{R^^-R^){g' -g*{g')) 



{R^-R^){g*-g) 



n 



To check (jlSp . note that with MA(«:) and the first assertion of Lemma we have G 

gis),g' eg{a): 

^Epih{g,Z)-exig',Z)y < CX-''\\e{g)-£{g')\\L, 

< CX-''6'/'"^ + CX-^\\iig*{g))-iig*ig'mL„ 
for < o" < 5. Taking the infimum with respect to g' £ ^(c), we get: 

\\e{g*{g))-£{g*{g'))U,=0. 



6.2.2 Proof of Theorem dO] 

The main ingredient of the proof is Lemma [T9l We want to find a convenient bound for the 
term (see the notations of Lemma [T9|) : 



U'xiS,t)=K 



t 

3n 



First note that since £{g,-) is bounded, we have the crude bound Kp£{g, X)"^ < MR{g), 
where M = \\£{g, ■)\\oo- Hence, we have, using the entropy condition: 



EZ;(<5) = E sup {R^-R^){g) 
geg'{5) 



< E sup 

II%)IIl2(P)<V^51/2 

< C^6—, 

vn 



iRil-R^){g) 
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where we use in last line Lemma 1 in iLoustaul (|2012l ). 
We obtain: 



■n \ n \ n ^ ^ ' Jn 3n 



V'xm < K 



Now, from Lemma [23} we have the following control of cr^((5): 



a'^{5)= sup jEphigy < CX-^y/Ei{g,Xy < CX'^Vd, 

where C > is a generic constant and where we use in the last inequality the boundedness 
assumption of £{g, •). Now by the second assertion of Lemma (20) 

b'^i6)= sup \\lxig,-)\\oo<CX~^-'/\ 



It follows that: 



U[i5,t)<K 



S 2 + W-A '^(52 + 
n \ n 



t t , ^ „ ^ X~f^ i-P t 



,(19) 



We hence have in this case the following assertion: 



1- A^''^ 1- 

t < A n6'P A V^6^ UU5, t) < . 

\ n 



From an easy calculation, we hence get with the notations of Lemma [T9l 



<5l(t) < K 



X~I3\ i+P 



(20) 



where iiT > is a generic constant. Let us consider, for any e > 0: 



KV2C 



+ (l + e) inf 



where (r^, a^, n^) G (0, 1)^ x (0, 1/q) are chosen later on as a function of e > 0. Using Lemma 
[Ml we have in dimension d = 1, for any r G (0, 1): 

ygeg, <-X^' + rR{g). 

r 

As a result, condition ()13p of Lemma [T9l is satisfied with a*(r. A) = CX?^ /r. The choice of 
A in Theorem 1101 warrants that: 



A2^< 



X-P\ 1+P 



(21) 
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Moreover, for any e > 0, we can find a triplet (r^, a^, ) G (0,1)2 X (0,1/g) such that: 

1 + re 
~ (1 - r,)ue(l - a,)' 

Inequalities (l20|), dm and ([22]) give us: 



(22) 



5 > max ( S'^{t), 



1 + re 



(1 - re)Ue{l - ae) 9&Q 



inf i?(ff), 



1 - re)aeUe 



-a*(r„A) 



Finally, we can apply Lemma [19] with the triplet (re,ae,Ue), t' = t + loglog^n and get the 
result since: 



1 



logg *' < - , -6 \ \-/3 / 1 

1 + P V^ / iogq n 



log 



6.3 Proof of Theorem m 

The proof of Theorem [H] uses a slightly different version of Theorem [3l First of all, an 
inspection of the proof of Theorem [3] shows that condition ([9]) in Theorem [3] can be replaced 
by the following control of the local complexity of the noisy empirical process: 



E sup 

9,9'eg(<5) 



{R>^-R^){g-g') 



'n 



(23) 



Hence, using Lemma [25] in the Appendix, gathering with condition (PRC), we can have 
05D with /9 = 0. 

However, the case /) = is not treated in Theorem [3] where p G (0, 1). From (|23p. and using 
the notations of Lemma [TS] (I16p in the proof of Theorem [3] becomes: 



Ux{5,t)<K 



We hence have the following assertion: 



■5-2 + Vt^5-2 + W - (1 + A-/5-1/2) ^2 + — 

"n \ n W n \ n in 



t < Ux{S, t) < K (l + Vi) . 



n 



Using the same algebra as above, we can use Lemma [T8l with: 



5 = K(l + Vt' 



and t' = t + log logg n. 



In this case, note that the choice oit' = t-\- log log„ n gives rise to the following asymptotic: 



5 \/log log n — —52 . 

'n 



and leads to an extra -y/log log n term in the rates of convergence. 
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7. Appendix 

7.1 Technical lemmas for the exact case 

Lemma 20 Suppose (NA) holds, and K satisfies assumption (Kl). Suppose ||/*?7||oo < 
Coo and sViPg^g \\^{_g,-)\\L2(K) < oo- Then, the two following assertions hold: 

(i) i{g) !-)• {i\{g) is Lipschitz with respect to X: 

yg,g' e g, \Mg, •) - ix{g', OIL.cp) < CiUf^.X-^'Wiig, •) - e{g', OIIl,, 
where C > is a generic constant which depends on Cqo and constants in (Kl). 

(ii) {ix{g),g E G} is uniformly bounded: 



sup||£a(5,-)||oo <C2ntiA, 



where C2 > is a generic constant which depends on constants in (Kl). 
Proof Using Plancherel and the boundedness assumption over / * 77, we have: 



Ep{h{g,Z)-ex{g',Z)f 



j)Cr,{j)*{lKx{£ig,.)-£ig',.)){z) 



f * r){z)dz 



< C I _|^[/C,(-)](t)|2|^[ Ik X i£{g, •) - £{g', •))](t)pcit 



< CX-'^\\eig)-£{g')\\l, 
where we use in last line the following inequalities: 



-2 |^[/C,(./A)](s)|2 = \T[]Cr,]{sX)f < Csup 



j^mitx) 



< C sup 



provided that (Kl) holds. 

By the same way, the second assertion holds since if £{g, •) G L'^{K): 



\hig,z)\ < 



K 



1 



z — X 



i{g,x) 



dx 



< CA-'^-V^. 



z — X 
X 



dx 



A straightforward generalization leads to the d-dimensional case. 
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Lemma 21 Suppose f belongs to the anisotropic Holder spaces ^{{s, L) with s = (si, . . . , s^). 
Let K, a kernel satisfying assumption K(m) with m = [sj G N'^. Suppose MA(k) holds 
with parameter k > 1. Then, we have: 



where C > O is a generic constant. 



{R - R'){g-g*{g)) < Ap/^^'^-'^ + -{R{g) - mi R{g)), 



Proof Note that we can write: 

{R^-R){g-g*) : 



K 



{£{g, x) - e{g*,x)) (eA(x) - /(x)) dx, 



where we omit the no tation g* = g*(g) for siin phcity. The first part of the proof uses 
Proposition 1 stated in lComte and Lacour 



Proposition 22 (IComte and Lacourl (120121 )) Let Bq{X) = sup^^g^d |/(a;o) — E/a(xo)|. 
Then, if f belongs to the anisotropic Holder space H{s,L), and K, is a kernel of order [sJ, 
we have: 

d 

i?o(A)<c^A;^ 

where C > denotes some generic constant. 

The rest of the proof uses the margin assumption MA(k) as fohows: 

d „ 

{R^-R){g-g*) < CVa;M \e{g,x)-eig*,x)\dx. 

.7 = 1 



< Cp^X;^^ l^\i{g,x)-e(g*., 

d 

< C^X'/{Rig)-R{g*))^ 

i=i 
d 



x)\'^dx 



+ -{R{g)-mfR{g)), 



where we use in last line Young's inequality: 

xy^ <ry + x^^'^~^ ^r < 1, 



with r = jj^. 
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7.2 Technical lemmas for the non-exact case 

Lemma 23 Suppose (NA) and DA(cq) holds, and IC satisfies assumption (Kl). Suppose 
11/ * T/||oo < Coo and supggg 11^(5, ■)\\l2(k) < Then, we have: 



where C( > is a generic constant which depends on cq, Cqo and constants in (Kl). 
Proof Using Plancherel and the boundedness assumption over f * r], we have as above: 



Epix{g,Zf 



-/C,(-)* lKxi{g,-)(z) 



Jk 

\-2/3 f- 

< C / \iig,z)\^fiz)dz 

Co Jk 

< CX-^'^Pi{g,Xf, 



f * rj{z)dz 



where we use in the third line assumption DA(co). 



Lemma 24 Suppose f belongs to the anisotropic Holder spaces T-L{s, L) with s = (si, . . . , s^j 
Let fC a kernel satisfying assumption K(m) with m = [sj . Then, we have, for any r > 0: 



R{g)-R\g)\<^j2^f +rR{g), 



where C > O is a generic constant which does not depend on r > 0. 
Proof We follow the first part of the proof of Lemma [2T] to get: 

d „ 

rH9)-R{9)\ < cJ2^7 \e{g,x)\dx. 



Now using DA(co), we have, for any r > 0: 



R\g)-R{g)\ < C^^xy ^ J^\£{g,x)\^dx 



C A'^ , 



Co 

< CY,\;^{R{g)f^ 

i=i 
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2r 



where we use in last line Young's inequality: 



< ay + x^/^-°,Va < 1, 



with a = I ■ 



7.3 Technical lemma for Theorem 1141 

Lemma 25 Suppose (PRC), (NA) and the kernel assumption (Kl) are satisfied and 
||X||oo < M . Suppose E||e|P < oo. Then: 



E sup 

(c,c*)eCxX,||c-c*|j2<5 

where C > is a positive constant. 



(i?^-i?^)(c*-c) <cntiA-^^^, 



Proof The proof follows Levrard ( 20121 ) applied to the noisy setting. First note that in 
the sequel, we need to introduce the following notation: 

1 " 

{Pn - P)(7a(c, Z) - 7a(c', Z):=-Y^ [7a(c, Zi) - 7a(c', Z^)] - Ep [jx{c, Z) - 7a(c', Z)] . 



i=l 



By smoothness assumptions over c i— )• min ||x — Cj||, for any c G W^^ and c* € 7W, we have: 

7a(c,z) - 7a(c*,z) = (c - c*, Vc7a(c*,z)) + ||c - c*\\Rx{c*,c- c*,z), 
where, with IPollardI (1 19821 ) we have: 



Ve7A(c*,z) = -2( / -ICr, 



z — X 



c\)lv*{x)dx, . 



Z — X 



{x- cl)lvAx)dx 



and R\{c* , c — c* , z) satisfies: 

|i?A(c*,c - c*,z)| < ||c - c*||"-^ ^|(c - c*, Vc7a(c*,2;))| + niax^(|||2; - Cj|| - ||x - c*||^ . 
Splitting the expectation in two parts, we obtain: 

E sup |P„-P|(7a(c*,.)-7a(c,.)) <E sup |P„ - P|(c* - c, Vc7a(c*, .)) 

c*eA^,||c-c*|p<(5 c*gA4,||c-c*|P<5 

+ y^E sup |P„-P|(-i2A(c*,c-c*,.)) (24) 

c*eX,l|c-c*||2<<5 
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To bound the first term in this decomposition, consider the random variable 

k d n 



Z„ = (P„-P)(c*-C,Ve7A(c*,.)> = ^EE(^«.J-<^)E / 

u=l j=l i=l •'^^ 

By a simple HoefFding's inequality, Z„ is a subgaussian random variable. Its variance can 
be bounded as follows: 



Zi — X 



A 



varZr, 



- E E(^«.. - ^ J 



j ^u^j ) dx 



1 



< -5E 
n 



< C-(5 / 

n 



n 



Z — X 



^x j ^u'^ J ) dx 



|j"[(7rj - c„+j)lv;+](t)|^dt 



^x j J ) dx 



where w"*" = argmax^j fy^ ^/C^ ("^X^) i^j ~ '^u,j)dx and iXj : x ^ xj, and where we use the 
same argument as in Lemma [20l unde r assump t ion (K l). We hence have using for instance 
a maximal inequality due to Massart iMassarti ( 2007 . Part 6.1): 



E sup (P„-P)(c*-c,Vc7a(c*,.)) <C 

\c*eyVl,||c-c*||2<5 / 



-V5. 



n 



We obtain for the first term in (f 
is smaller, note that from .PoUard (|l982l ). we have: 



the right order. To prove that the second term in ([2 



\Rx{c*,c-c\z)\ < ||c-c*||-M (c-c*,Vc7a(c*,z)) + max (|||z-c 



< ||Ve7A(c*,z)|| + ||c-c*|ri Yl I 

j=i,...k 

< C{Ut,X-^' + \\z\\) 
we we use in last line: 



\Z — Ci 



\z — c„- 



* l|2 



Z — C 



* l|2| 



|Vc7a(c*,z) 



z — x 



{xj-ci^^)iv:{x)dx] <cnu\ 



-213, 



Hence it is possible to apply a chaining argument as in iLevrardI h^lj ) to the class 

F ={Ry^{c*,c-c\-),c* eM,ceW'"^: ||c-c*|| <V5], 

which has an enveloppe function F{-) < C(n^^^A^^' + || • ||) G -^2(-P) provided that 
EllelP < oo. We arrive at the conclusion. ■ 
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